Skip to content

Improve performance 20%-110% #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Dec 2, 2019
Merged

Improve performance 20%-110% #14

merged 13 commits into from
Dec 2, 2019

Conversation

htmldoug
Copy link
Contributor

@htmldoug htmldoug commented Sep 24, 2019

Summary

Adds simple benchmarks for VPackBuilder, VPackSlice, and VPackParser. Improves runtime performance by ~110% within VPackBuilder/VPackSlice. String => VPackSlice is dominated by json parsing and remains about the same.

Results

https://jmh.morethan.io/?sources=https://gist.githubusercontent.com/htmldoug/d292b301fdc3ec66cd7098e4ee54e965/raw/14e54eb2c9b153ba81931d16b276511eb3796f7f/jmh-result-before.json,https://gist.githubusercontent.com/htmldoug/d292b301fdc3ec66cd7098e4ee54e965/raw/14e54eb2c9b153ba81931d16b276511eb3796f7f/jmh-result-optimized.json

Context

I don't use arangodb, but I'm interested in using velocypack to reduce the heap usage of my json processing. I recently saw a 35 MB byte array get parsed to 270 MB of heap by play-json's JsValue. Case classes ate my RAM inspired backing the JsValue by a VPackSlice and it drops the heap usage to ~38 MB for the cost of about 33% more CPU time.

JMH + JFR + JMC revealed some some low-hanging fruit to cut the extra CPU time roughly in half for my use case. I also benched a real world example of String => VPackSlice from arrangodb/velocypack. The majority of time is spent on json parsing, although throughput still improved by 20%.

Optimizations

The gains come mostly from reducing UTF-8 encoding by string.getBytes() and replacing HashMap<Integer, T> lookups for sequential indexes with simple T[] arrays.

CLA

I've emailed a signed CLA to cla@arangodb.com.

@htmldoug htmldoug changed the title Performance Optimizations Improve performance 20%-110% Sep 24, 2019
@OmarAyo
Copy link

OmarAyo commented Sep 25, 2019

CLA Available

* Index of the string bytes within {@link vpack},
* i.e. tag byte and length are somewhere before this index.
*/
private int start;
Copy link
Contributor Author

@htmldoug htmldoug Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure about this decision. Starting at the tag byte would make it more similar to VPackSlice, but it would add a bit more complexity to the compareToBytes operations. Those comparisons are on the hot path for the O(ln n) path queries and for sorting the offset table while constructing the object.

I couldn't think of a good way to make this class private, so it's probably worth giving it some thought before merging so we don't break compatibility later.

@mpv1989 @hkernbach What do you think?

@rashtao
Copy link
Collaborator

rashtao commented Sep 26, 2019

@htmldoug Thanks for contributing!

@rashtao rashtao self-assigned this Sep 26, 2019
@htmldoug
Copy link
Contributor Author

htmldoug commented Oct 4, 2019

@rashtao any idea when you'll get to this or should I fork?

@rashtao
Copy link
Collaborator

rashtao commented Oct 7, 2019

@htmldoug thanks for the reminder, we will likely review it in the next weeks.

@htmldoug
Copy link
Contributor Author

I'll give this another two weeks then probably close and fork.

@siilike
Copy link
Contributor

siilike commented Nov 15, 2019

It is sad to see that such great (and obvious!) performance improvements take so long to get accepted.

@rashtao
Copy link
Collaborator

rashtao commented Nov 19, 2019

@htmldoug I am reviewing the PR, but there are some points unclear to me:

  1. the code has many warnings, eg. unused imports and Javadoc warnings
  2. the code is not compatible with java 1.6, which is currently the base java version of the project
  3. using -XX:+UnlockCommercialFeatures and JmhFlightRecorderProfiler is not compatible with our license and cannot be used with opensource software, at least until we will use jdk 11
  4. running the benchmark I see no improvement for Bench.fromJson, actually it gets slightly worse (0.3% slower). In the other 2 benchmarks I get results similar to yours.

I would respectively suggest you to:

  1. fix
  2. set the minimum required java version to 1.7 for the project
  3. remove -XX:+UnlockCommercialFeatures and JmhFlightRecorderProfiler
  4. you can leave Bench.fromJson as is, so it will help us in the future

Also, upgrading to java 7 as lower required java version, would enable us to use ByteBuffers, thus avoiding Arrays.copy every time that we have to resize the buffers, thus achieving imho considerable performance improvements.

Thanks for contributing and sorry for the late reply!

siilike added a commit to siilike/java-velocypack that referenced this pull request Nov 19, 2019
@siilike siilike mentioned this pull request Nov 19, 2019
@htmldoug
Copy link
Contributor Author

htmldoug commented Nov 30, 2019

the code has many warnings, eg. unused imports and Javadoc warnings

I added the -Werror compiler arg and fixed the warning. I couldn't find any Javadoc warnings. If they're still there, I'll need some guidance to repro.

set the minimum required java version to 1.7 for the project

done!

remove -XX:+UnlockCommercialFeatures and JmhFlightRecorderProfiler

Done. Replaced with: if (java.version >= 11) then jvmArgs += -XX:StartFlightRecording.

running the benchmark I see no improvement for Bench.fromJson, actually it gets slightly worse (0.3% slower). In the other 2 benchmarks I get results similar to yours.

Good observation. Looks like the results of that benchmark vary significantly between jvm forks. Setting @Fork(5), the results even out:

Before:
Benchmark                                    Mode  Cnt        Score       Error   Units
Bench.fromJson                               avgt  150        6.260 ±     0.089   ms/op
Bench.fromJson:·gc.alloc.rate                avgt  150      730.280 ±    11.476  MB/sec
Bench.fromJson:·gc.alloc.rate.norm           avgt  150  7171474.987 ± 16830.651    B/op
Bench.fromJson:·gc.churn.G1_Eden_Space       avgt  150      440.880 ±     9.614  MB/sec
Bench.fromJson:·gc.churn.G1_Eden_Space.norm  avgt  150  4329014.558 ± 62921.591    B/op
Bench.fromJson:·gc.churn.G1_Old_Gen          avgt  150        0.007 ±     0.010  MB/sec
Bench.fromJson:·gc.churn.G1_Old_Gen.norm     avgt  150       67.622 ±    94.356    B/op
Bench.fromJson:·gc.count                     avgt  150     1201.000              counts
Bench.fromJson:·gc.time                      avgt  150      693.000                  ms
Benchmark result is saved to target/jmh-result/2019-11-29_20-00-26.json

After:
Benchmark                                    Mode  Cnt        Score       Error   Units
Bench.fromJson                               avgt  150        6.237 ±     0.121   ms/op
Bench.fromJson:·gc.alloc.rate                avgt  150      704.050 ±    15.752  MB/sec
Bench.fromJson:·gc.alloc.rate.norm           avgt  150  6868573.344 ± 11200.920    B/op
Bench.fromJson:·gc.churn.G1_Eden_Space       avgt  150      413.042 ±    11.590  MB/sec
Bench.fromJson:·gc.churn.G1_Eden_Space.norm  avgt  150  4029680.844 ± 68520.798    B/op
Bench.fromJson:·gc.count                     avgt  150      917.000              counts
Bench.fromJson:·gc.time                      avgt  150      614.000                  ms
Benchmark result is saved to target/jmh-result/2019-11-29_20-06-51.json

alloc.rate.norm seems to have improved, but throughput is within the margin of error. Jackson dominates CPU time on this benchmark, so I guess the lack of improvement there shouldn't be surprising:

Screen Shot 2019-11-29 at 8 34 08 PM

@htmldoug
Copy link
Contributor Author

I think I've figured out and fixed the javadoc warnings.

@rashtao rashtao self-requested a review December 2, 2019 11:48
Copy link
Collaborator

@rashtao rashtao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rashtao rashtao merged commit 6ba8cc9 into arangodb:master Dec 2, 2019
@htmldoug
Copy link
Contributor Author

htmldoug commented Jan 8, 2020

Also, upgrading to java 7 as lower required java version, would enable us to use ByteBuffers, thus avoiding Arrays.copy every time that we have to resize the buffers, thus achieving imho considerable performance improvements.

Hey @rashtao, how would this work? I only have a light understanding of ByteBuffers, but my impression was that HeapByteBuffer and DirectByteBuffer were fixed capacity, i.e. not resizable. Are you picturing composing these into a composite like akka's ByteString? How were you envisioning this working?

@rashtao
Copy link
Collaborator

rashtao commented Jan 9, 2020

@htmldoug
I mean DirectByteBuffer would have higher write throughput than byte[] and not impact on GC. I have no experience with akka's ByteString, but I think also composing could lead to additional performance improvements, eg. something like Netty's CompositeByteBuf.

@htmldoug
Copy link
Contributor Author

htmldoug commented Jan 9, 2020

ensureCapacity() is producing a ton of non-TLAB allocations. Growing the array by 1.5x rather than the more typical 2x is causing a lot of churn.

I'm still not entirely sure how DirectByteBuffer would help, but a composite (at least within the builder) would be really good here. I started exploring that direction, but one barrier to implementing is that VPackBuilder.remove() is doing deletions via some left-shifting of the array that I haven't quite figured out yet and I ran out of time.

@siilike
Copy link
Contributor

siilike commented Jan 11, 2020

Why not just implement an OutputStream that stores everything in a List<byte[]>? Then we could have a toByteArray() method that copies the whole content to a single byte array (e.g. for Slice) and write* methods that are optimized to use the original byte arrays directly.

The only issue is the Builder.remove() method that needs some more attention.

For a more complex solution Netty has CompositeByteBuf which works similarly, but uses ByteBuffers for storage.

@htmldoug
Copy link
Contributor Author

htmldoug commented Jan 11, 2020

@siilike, yeah, that'd be an improvement and there's good precedent for that.

Dreaming about ideal state for a moment, I'd love the option to be able to back both VPackBuilder and VPackSlice entirely on disk for situations where I'm dealing with hundreds-of-megabytes json files that my apps sometimes receive. I haven't checked, but I'd be surprised if arangodb isn't already doing something similar.

This would require decoupling VPackBuilder/Slice from their backing implementation, which could be done by identifying all the operations currently used (appendByte/s(...), readByte/s(...), even leftShift()?), and stuffing them into some pluggable interface VPackStore {...}. That'd enable your ListByteVPackStore, the current baseline CompactArrayByteVPackStore, @rashtao's DirectByteBufferVPackStore, or a kafka-style RandomAccessFileVPackStore for my stupidly large inputs.

@rashtao
Copy link
Collaborator

rashtao commented Jan 13, 2020

I would suggest encapsulating all the buffer operations in one single class, call it eg. VPackSliceBuffer, and implementing there all the methods for manipulating the buffer. The first implementation can simply delegate the underlying byte array operations. Once we have it we can test and benchmark different implementations. We can even think about having different buffer implementations and leave to the user the possibility to choose which one to use.

Also I would move the discussion to https://arangodb-community.slack.com/archives/C5T51CW2J so we can get feedback and opinions from the community.

@siilike
Copy link
Contributor

siilike commented Jan 13, 2020

"Don't have an account on this workspace yet?
Contact the workspace administrator for an invitation"

@rashtao
Copy link
Collaborator

rashtao commented Jan 14, 2020

Sorry @siilike , you can register here: https://slack.arangodb.com/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants